[DPE-9519]: add quorum reconfig based on cluster size#21
Open
[DPE-9519]: add quorum reconfig based on cluster size#21
Conversation
…ation on leader elected
This pull request introduces a significant refactor to how the codebase handles primary endpoint addressing, shifting from using only IP addresses to supporting both IP addresses and hostnames, depending on the deployment substrate. It updates configuration management and event handling to use the new `primary_endpoint` concept, adds hostname resolution to Sentinel management, and ensures better compatibility with non-VM substrates. Additionally, it introduces a utility function for IP validation and improves Sentinel configuration for hostname-based communication. Key changes include: **Primary Endpoint Refactor and Configuration Management:** - Refactored all configuration and service management logic in `config.py` and related event handlers to use a new `primary_endpoint` parameter (which can be a hostname or IP) instead of just `primary_ip`. This includes updating method signatures, internal logic, and how endpoints are determined based on the substrate (VM vs. Kubernetes). [[1]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL106-R106) [[2]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL140-R147) [[3]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL156-R160) [[4]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL50-R50) [[5]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL85-R88) [[6]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL97-R122) [[7]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL196-R205) [[8]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL244-R253) [[9]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL253-R268) [[10]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cR284-R292) [[11]](diffhunk://#diff-a24e7472936c291b5c95f9c56ccc25cccef3adf4c4cbd1de135e2dde31ae592cL338-R361) [[12]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4L161-R166) [[13]](diffhunk://#diff-3dfb4a104b407f7d4b2e07795a63745a09cde228a8dbe86602e7b72ea600fbe4L209-R212) - Updated event-driven updates to always use FQDN hostnames instead of short hostnames, improving consistency and DNS compatibility. [[1]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL106-R106) [[2]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL248-R256) [[3]](diffhunk://#diff-977a7d8c04cab4f9aaa5659892eadd9b991f51cd8f5814ecb51e3162bf23e31eL290-R298) **Sentinel Management Improvements:** - Enhanced Sentinel management to resolve hostnames to IPs when necessary, using the new `is_valid_ip` helper, and improved logic for retrieving active Sentinel IPs. [[1]](diffhunk://#diff-b226109a257f8cc9cb6b0a4a1eb4c1c730d2cc9620744b6855e4f3a96ca3041dR22) [[2]](diffhunk://#diff-b226109a257f8cc9cb6b0a4a1eb4c1c730d2cc9620744b6855e4f3a96ca3041dL272-R285) - Added `resolve-hostnames` and `announce-hostnames` options to the generated Sentinel configuration to enable hostname-based communication within the cluster. **Utility Enhancements:** - Introduced a new `is_valid_ip` helper function in `common/helpers.py` to reliably check if a string is a valid IP address, supporting the above refactors and Sentinel logic. These changes collectively improve the charm's flexibility and reliability in heterogeneous environments, especially for Kubernetes and other non-VM substrates.
reneradoi
approved these changes
Mar 13, 2026
Contributor
reneradoi
left a comment
There was a problem hiding this comment.
No objections from my side. The Sentinel quorum is only used for sentinel down detection and doesn't impact failover voting, so we could even be a bit more relaxed and only update this in the start workflow and on the departure events.
There was a problem hiding this comment.
Pull request overview
This PR adds dynamic Sentinel quorum management so the configured quorum automatically adjusts to the current cluster size, and updates unit/integration tests to validate quorum behavior during scaling and lifecycle events.
Changes:
- Replaced the static Sentinel quorum constant with a computed
ConfigManager.quorumand used it in sentinel config generation. - Added Sentinel quorum read/update operations (
get_configured_quorum,set_quorum, andSentinelClient.set) and hooked automatic reconfiguration into peer relation changed/departed events. - Enhanced integration/unit tests with quorum assertions and helper utilities (plus minor integration timeout adjustments).
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/managers/config.py |
Computes quorum dynamically from active units and uses it in sentinel monitor config. |
src/common/client.py |
Adds SentinelClient.set() to update Sentinel config via CLI. |
src/managers/sentinel.py |
Adds methods to read and write the currently configured sentinel quorum. |
src/events/base_events.py |
Triggers quorum reconfiguration on peer relation changed/departed. |
tests/unit/test_charm.py |
Adds unit tests covering quorum update/no-update behavior; adjusts patches for new quorum reads. |
tests/unit/test_tls.py |
Updates TLS rotation unit tests to patch Sentinel quorum reads. |
tests/integration/helpers.py |
Extends CLI helper to support custom ports/JSON and adds get_quorum() helper. |
tests/integration/ha/test_scaling.py |
Adds quorum assertions across initial deploy/scale up/scale down paths. |
tests/integration/continuous_writes.py |
Increases Glide request timeout to reduce flakiness during scaling tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+323
to
+328
| return int(client.primary(self.state.bind_address)["quorum"]) | ||
|
|
||
| def set_quorum(self, quorum: int) -> None: | ||
| """Set the quorum for the sentinel cluster.""" | ||
| client = self._get_sentinel_client() | ||
| client.set(self.state.bind_address, PRIMARY_NAME, "quorum", str(quorum)) |
| def set_quorum(self, quorum: int) -> None: | ||
| """Set the quorum for the sentinel cluster.""" | ||
| client = self._get_sentinel_client() | ||
| client.set(self.state.bind_address, PRIMARY_NAME, "quorum", str(quorum)) |
Comment on lines
+488
to
+497
| def get_quorum(juju: jubilant.Juju, unit_name: str) -> int: | ||
| """Get the currently configured sentinel quorum.""" | ||
| status = juju.status() | ||
| model_info = juju.show_model() | ||
| units = status.get_units(APP_NAME) | ||
| unit_endpoint = ( | ||
| units[unit_name].public_address | ||
| if model_info.type != "kubernetes" | ||
| else units[unit_name].address | ||
| ) |
Comment on lines
+498
to
+504
| result = exec_valkey_cli( | ||
| hostname=unit_endpoint, | ||
| username=CharmUsers.SENTINEL_CHARM_ADMIN.value, | ||
| password=get_password(juju, user=CharmUsers.SENTINEL_CHARM_ADMIN), | ||
| command="SENTINEL primary primary", | ||
| port=SENTINEL_PORT, | ||
| json=True, |
Comment on lines
+547
to
+556
| if self.charm.sentinel_manager.get_configured_quorum() != self.charm.config_manager.quorum: | ||
| logger.debug("Updating sentinel quorum to match current cluster size") | ||
| try: | ||
| self.charm.sentinel_manager.set_quorum(self.charm.config_manager.quorum) | ||
| self.charm.config_manager.set_sentinel_config_properties( | ||
| self.charm.sentinel_manager.get_primary_ip() | ||
| ) | ||
| except ValkeyWorkloadCommandError as e: | ||
| logger.error(f"Failed to update sentinel quorum: {e}") | ||
| # not critical to defer here, we can wait for the next relation change or config change to try again |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces dynamic quorum management for the sentinel cluster, ensuring that the quorum value automatically adjusts based on the number of active units in the cluster. It also adds robust event handling to trigger quorum reconfiguration when units join or depart, and updates integration tests to verify correct quorum behavior during scaling operations.
Dynamic quorum management:
quorumproperty toConfigManagerthat calculates the required quorum based on the number of active units, replacing the staticQUORUM_NUMBERconstant. [1] [2]Sentinel quorum operations:
get_configured_quorumandset_quorummethods inSentinelManagerto retrieve and update the quorum value via the CLI, and added a genericsetmethod to the client for configuration changes. [1] [2]_reconfigure_quorum_if_necessarymethod to handle quorum updates, and integrated it with peer relation change and departure event handlers for automatic reconfiguration. [1] [2] [3]Testing enhancements:
get_quorumhelper and updated integration tests to assert correct quorum values after scaling up and down the cluster, improving test coverage for HA scenarios. [1] [2] [3] [4] [5] [6] [7]These changes collectively ensure that the sentinel cluster maintains the correct quorum configuration as the cluster size changes, improving reliability and correctness in high-availability deployments.